Skip to main content

Classic Asynchronous Workflow Example

This section describes the most common usage of the asynchronous endpoints.

The purpose of this flow is:

  1. Create a collection with multiple requests.
  2. Execute a Run for this collection.
  3. Query the run status until it completes.

1. Create a collection

First, we create a collection that groups the requests we want to run asynchronously.

Endpoint

POST /v1/async/collections

Request Body Example

{
"name": "new collection",
"requests": [
{
"url": "www.google.com",
"browser": true,
"screenshot": false,
"actions": [
{
"type": "wait-for-timeout",
"time": 5000
}
]
},
{
"url": "www.example.com",
"browser": true,
"screenshot": false,
"actions": [
{
"type": "wait-for-timeout",
"time": 5000
}
]
}
]
}

Expected Response

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "new collection",
"message": "Collection created successfully."
}

At this point, the collection is ready to be executed. Save the collection_id as it will be needed for the following steps.


2. Create a Run for the Collection

Once the collection has been created, we can start the Run execution.

A Run represents a single execution of the requests placed in the collection.

Endpoint

POST /v1/async/collections/{collection_id}/run

Parameters

No body is needed for this request, only the collection_id is required as a parameter.

Response Example

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "in_progress",
"total_requests": 2,
"success_requests": 0,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"
}

The run is created and begins executing asynchronously.

  • The initial status always starts as in_progress.
  • The run_id uniquely identifies the execution and should be saved to track the Run.

3. Query the Run Status

Since the run is asynchronous, the execution takes a variable amount of time depending on the number of requests and their complexity.

After a short wait, you can query the run status using the run_id.

Endpoint

GET /v1/async/collections/{collection_id}/runs/{run_id}

Response Example (Still In Progress)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "in_progress",
"total_requests": 2,
"success_requests": 1,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"
}

This response indicates that the Run has started but has not finished.


4. Run Completed

After waiting long enough, the query will return the completed run.

Response Example (Completed)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "completed",
"total_requests": 2,
"success_requests": 2,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"
}

At this point:

  • status is completed.
  • All requests defined in the collection have been processed.
  • success_requests counts jobs that returned usable content (HTTP 2xx + no captcha/block signal). failed_requests includes worker failures and jobs that completed but whose target returned 4xx/5xx or a block page. timeout_requests covers jobs that exceeded the worker-level timeout. The invariant total_requests = success + failed + timeout always holds on a completed run.

Retrieving per-job results

The run-status endpoint above is a summary. To iterate each job's URL, custom_id, timings, and HTML, use the jobs listing endpoint with cursor pagination:

cursor = None
while True:
params = {"limit": 500, "order_by": "completed_at", "status_filter": "completed,failed,timeout"}
if cursor:
params["cursor"] = cursor
page = requests.get(
f"{BASE_URL}/v1/async/collections/{COLLECTION_ID}/runs/{run_id}/jobs",
headers=HEADERS, params=params,
).json()
for job in page["items"]:
handle(job["custom_id"], job["url"], job["status"], job["status_code"])
if not page.get("has_more"):
break
cursor = page["cursor_next"]

With order_by=completed_at + since_completed_at you can stream completions incrementally without re-paginating the whole run on each poll. See the API reference for the full semantics.

For the full HTML or extracted data of a specific job: GET /v1/async/collections/{cid}/runs/{run_id}/jobs/{job_id}/result. HTML bodies are retained 48 hours after completion; metadata (status, timings, URL, custom_id) is retained 90 days in the listing endpoint.


Summary

This code flow follows a simple pattern that breaks down into 3 parts:

  1. Create a Collection With one or more requests (each may include an optional custom_id for traceability).
  2. Start the Execution Run the collection asynchronously.
  3. Query the Run Status Using the run ID until it completes — then iterate the jobs listing to process each result.

Two small additions make the workflow safe against transient failures:

Generate an Idempotency-Key per submit

Pass a UUID in the Idempotency-Key header on POST /v1/async/collections. If the response is lost to a network timeout, a retry with the same key within 24 h returns the original collection without creating a duplicate (and without a second charge):

import uuid, requests

key = str(uuid.uuid4())
resp = requests.post(
f"{BASE_URL}/v1/async/collections",
headers={**HEADERS, "Idempotency-Key": key},
json={"name": "daily-2026-04-30", "requests": [...]},
)
# Safe to retry resp on timeout — same key + same body returns the same collection.

Reattach to a live run via GET /collections/{cid}/runs

If the response to POST /run is lost, you don't need to retry the run (which would queue a duplicate). Look it up by collection — the run_id is already created server-side:

runs = requests.get(
f"{BASE_URL}/v1/async/collections/{collection_id}/runs?status_filter=in_progress",
headers=HEADERS,
).json()
if runs["total"] > 0:
run_id = runs["items"][0]["run_id"] # reattach
else:
run_id = requests.post(
f"{BASE_URL}/v1/async/collections/{collection_id}/run",
headers=HEADERS,
).json()["run_id"]

These two patterns combined remove the most common production failure mode: doubled batches caused by client-side retry of an already-successful request.